## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
The original dataset has 1599 instances with 13 variables. A variable (quality_cat) is added to the dataset to categorize the quality of the wine into Bad, Average and Good corresponding to the values - upto 4, 5 to 7 and 8 and above respectively. Another variable(wine_quality) is added to store the factor of quality variable.
## [1] 1599 15
## 'data.frame': 1599 obs. of 15 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ quality_cat : Factor w/ 3 levels "Bad","Average",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ wine_quality : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality quality_cat wine_quality
## Min. : 8.40 Min. :3.000 Bad : 63 3: 10
## 1st Qu.: 9.50 1st Qu.:5.000 Average:1518 4: 53
## Median :10.20 Median :6.000 good : 18 5:681
## Mean :10.42 Mean :5.636 6:638
## 3rd Qu.:11.10 3rd Qu.:6.000 7:199
## Max. :14.90 Max. :8.000 8: 18
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
In the plot of fixed.acidity there is a peak around 7. When we change the binwidth in histogram, the shape of the distribution changes.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
The value of citric acid ranges from 0 to 1. Most of the wines have zero citric acid. The second peak is at 0.48/.49.
The median is at 0.260, the 3rd quartile is at 0.420 and the max is at 1.000
which indicates the presence of the outliers. Did you notice the plot at 1 ? yes, it’s an outlier.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The distribution of the volatile.acidity shows there are two peaks at 0.6 and 0.4 in the histogram plot. Summary results shows the presence of outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
The distribution of the residual.sugar has a highest peak at 2. The max is at 15.50, Median is at 2.200 and 3rd quartile is at 2.600, these values indicates the presence of outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
The distribution of chlorides has a peak at 0.08/0.09. The value of chlorides ranges from 0.01 to 0.6. The result summaries shows the presence of outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
The distribution of free.sulfur.dioxide has a peak at 5 in the histogram plot. There are values even beyond 60. The summary result indicates the presence of outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
The distribution of density in the bar plot has different shades of lines. When there are more datapoints at particular x value, the line has a darker shade. When there are few datapoints at a particular x value, the line has a lighter shade. Also, each scale in x-axis is very small.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
The distribution of total.sulfur.dioxide in the histogram plot has a peak value around 20. The median, mean, 3rd quartile and the max in summary result shows the presence of outliers. I got curious to know the difference between the total.sulfur.dioxide and free.sulfur.dioxide. free.sulfur.dioxide is the amount of free form of so2, whereas total.sulfur.dioxide is the amount of free and bound forms of so2.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
Most of the pH values falls between 3.0 and 3.5. The pH value ranges from 2.74 to 4.01. The pH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic). The summary results shows that the mean and median values are pretty close which means we have a normal distribution .
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The distribution of sulphates has a peak at 0.6. There are visible plots even beyond 2.0. The summary results indicates the presence of outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The distribution of alcohol has a peak at 9.4. The value of alcohol ranges from 8.40 to 14.90. The summary results indicates the presence of outliers.
## 3 4 5 6 7 8
## 10 53 681 638 199 18
The distribution of quality has the highest peak at 5 and the second highest at 6. The quality ranges from 3( bad) to 8 (very good). Quality is the output variable.
## Bad Average good
## 63 1518 18
A variable(quality_cat) is added to categorize the quality of the wine into Bad, Average and Good. A wine is categorized into Bad, when its value is 4 or less. A wine is categorized into Average, when its value falls in the range of 5 to 7. A wine is categorized into good, when its value is 8 and above.
The distribution of quality_cat shows that most of the wines falls under the average wine quality. The summary clearly shows the observations in each category. Above 90% of the wines falls in average category of wine.
The red wine dataset consists of 1599 observations with 13 variables. The variables are id, fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol and quality.
I would like to determine “Which chemical properties influence the quality of red wines?”. The main features in the data are Alcohol and pH. As per my obervation,
The other features like citric acid, density, fixed.acidity, volatile.acidity, sulphates, free.sufur.dioxide, total.sufur.dioxide, chlorides and sulphates might support the investigation. These variables can be used to build a predictive model to determine the influence on the quality of the red wines.
A variable(quality_cat) is added to categorize the quality of the wine into Bad,Average and Good corresponding to the values - upto 4, 5 to 7 and 8 to 10 respectively. Another variable (wine_quality) is added to store the factor of quality variable.
The chosen dataset is in tidy format so there wasn’t any need to change the form of the data. pH had a normal distribution which is different from other variables distribution. The log transformations is calculated on some variables.
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## fixed.acidity volatile.acidity citric.acid residual.sugar
## fixed.acidity 1.00000000 -0.256130895 0.67170343 0.114776724
## volatile.acidity -0.25613089 1.000000000 -0.55249568 0.001917882
## citric.acid 0.67170343 -0.552495685 1.00000000 0.143577162
## residual.sugar 0.11477672 0.001917882 0.14357716 1.000000000
## chlorides 0.09370519 0.061297772 0.20382291 0.055609535
## free.so2 -0.15379419 -0.010503827 -0.06097813 0.187048995
## total.so2 -0.11318144 0.076470005 0.03553302 0.203027882
## density 0.66804729 0.022026232 0.36494718 0.355283371
## pH -0.68297819 0.234937294 -0.54190414 -0.085652422
## sulphates 0.18300566 -0.260986685 0.31277004 0.005527121
## alcohol -0.06166827 -0.202288027 0.10990325 0.042075437
## quality 0.12405165 -0.390557780 0.22637251 0.013731637
## chlorides free.so2 total.so2 density
## fixed.acidity 0.093705186 -0.153794193 -0.11318144 0.66804729
## volatile.acidity 0.061297772 -0.010503827 0.07647000 0.02202623
## citric.acid 0.203822914 -0.060978129 0.03553302 0.36494718
## residual.sugar 0.055609535 0.187048995 0.20302788 0.35528337
## chlorides 1.000000000 0.005562147 0.04740047 0.20063233
## free.so2 0.005562147 1.000000000 0.66766645 -0.02194583
## total.so2 0.047400468 0.667666450 1.00000000 0.07126948
## density 0.200632327 -0.021945831 0.07126948 1.00000000
## pH -0.265026131 0.070377499 -0.06649456 -0.34169933
## sulphates 0.371260481 0.051657572 0.04294684 0.14850641
## alcohol -0.221140545 -0.069408354 -0.20565394 -0.49617977
## quality -0.128906560 -0.050656057 -0.18510029 -0.17491923
## pH sulphates alcohol quality
## fixed.acidity -0.68297819 0.183005664 -0.06166827 0.12405165
## volatile.acidity 0.23493729 -0.260986685 -0.20228803 -0.39055778
## citric.acid -0.54190414 0.312770044 0.10990325 0.22637251
## residual.sugar -0.08565242 0.005527121 0.04207544 0.01373164
## chlorides -0.26502613 0.371260481 -0.22114054 -0.12890656
## free.so2 0.07037750 0.051657572 -0.06940835 -0.05065606
## total.so2 -0.06649456 0.042946836 -0.20565394 -0.18510029
## density -0.34169933 0.148506412 -0.49617977 -0.17491923
## pH 1.00000000 -0.196647602 0.20563251 -0.05773139
## sulphates -0.19664760 1.000000000 0.09359475 0.25139708
## alcohol 0.20563251 0.093594750 1.00000000 0.47616632
## quality -0.05773139 0.251397079 0.47616632 1.00000000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
I used pairs.panels from psych package to create a correlation plot. The markings on the top or bottom of each column are the value ranges of the variable described in that particular column. For example, density has a minimum value of 0.9901 and maximum value of 1.0040 in the summary and you can notice density has a marking of 0.9900 to 1.00 in the plot. Setting fig.width and fig.height in the r-chunks greatly helped in the visualization of the plot.
## red_wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.575 11.000
## --------------------------------------------------------
## red_wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## red_wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## red_wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## red_wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## red_wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
In the bar plot, most of the wines falls into the average quality. It clearly demonstrates the predominant distribution of wine_quality values of 5, 6 and 7. The Alcohol vs wine quality have a positive correlation than quality - any other variable combination. The summary results shows that the wine quality 8 have highest median value of 12.15.
The variables like citric.acid, fixed.acidity, alcohol and sulphates show positive correlation with the quality_cat. As the value of these variables increases the quality_cat also increases.
## red_wine$quality_cat: Bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04500 0.06850 0.08000 0.09573 0.09450 0.61000
## --------------------------------------------------------
## red_wine$quality_cat: Average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08735 0.09000 0.61100
## --------------------------------------------------------
## red_wine$quality_cat: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04400 0.06200 0.07050 0.06844 0.07550 0.08600
The quality of the wine decreases when the variables like pH, volatile.acidity and chlorides increases. In the chlorides-quality_cat boxplots, the median of Bad and Average categories are pretty close. The summary of chlorides with quality_cat shows their median values.
## red_wine$quality_cat: Bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9934 0.9957 0.9966 0.9967 0.9977 1.0010
## --------------------------------------------------------
## red_wine$quality_cat: Average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9968 0.9979 1.0037
## --------------------------------------------------------
## red_wine$quality_cat: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9908 0.9942 0.9949 0.9952 0.9972 0.9988
There is a mixed trend of quality_cat with the variables like density, residual.sugar, free.sulfur.dioxide and total.sulfur.oxide. The median values improves from bad to average and then drops from average to good.
The scatterplots of citric.acid vs pH, fixed.acidity vs pH and density vs pH
shows a linear association. However, there is still quite a bit of scatter around the pattern. A positive correlation value indicates a positive linear association and negative value indicates a negative linear association. Consequently, a correlation values of -0.54, 0.66, 0.67 and 0.68 are reasonable to strong. It is common for a correlation to decrease as sample size increases.
## fixed.acidity citric.acid density pH
## fixed.acidity 1.0000000 0.6717034 0.6680473 -0.6829782
## citric.acid 0.6717034 1.0000000 0.3649472 -0.5419041
## density 0.6680473 0.3649472 1.0000000 -0.3416993
## pH -0.6829782 -0.5419041 -0.3416993 1.0000000
fixed.acidity has a linear association with density,citric.acid and pH. The scatterplots of these variables are shown above.
The scatterplot of residual.sugar vs citric.acid is overplotting. When geom_jitter is used, it reduces the overplotting.
The boxplots of free.sulfur.dioxide vs quality_cat and total.sulfur.dioxide vs quality_cat demonstrates same type of distribution but their values varies.
I was curious to see the distribution of free.sulfur.dioxide vs total.sulfur.dioxide. free.sulfur.dioxide and total.sulfur.dioxide have a strong relationship with a correlation value of 0.667.
The correlation tables demonstrates the relationships among various variables. The variable quality has a positive relationship with fixed.acidity, citric.acid, alcohol and residual.sugar. The quality has a negative relationship with volatile.acidity, pH and chlorides. A relationship between two variables are said to be - strong when the correlation values are between ( 0.5 and 1) or ( -1 and -0.5) - moderate when the correlation values are between ( 0.3 and 0.5) or ( -0.3 and -0.5) The correlation value of alcohol - quality has a moderate relationship with the value of 0.476 and the volatile.acidity - quality pair has a moderate relationship with the value of -0.3905. I was disappointed to know that the quality has 0.476 as its highest correlation value and the variable that is paired with is Alcohol. A strong relationship is seen in pH- fixed.acidity (-0.6829782) and pH- citric.acid(-0.5419041) plots.
A strong relationship is seen in pH- fixed.acidity(-0.6829782) and pH- citric.acid(-0.5419041) plots. fixed.acidity has a strong relationship with density,citric.acid and pH. The values are as follows
citric.acid - fixed.acidity 0.6717034
density - fixed.acidity 0.6680473
pH - fixed.acidity -0.6829782
The strongest relationship is pH-Fixed.acidity with a correlation value of -0.683.
Multivariate plotting starts with the strongest relationship pair pH-fixed.acidity based on the quality_cat. The scatterplot shows that the
maximum datapoints are from the average quality of wine.
Distribution of citric acid - pH based on the quality_cat have most of the values in average quality. citric acid-pH have a correlation value of -0.542.
Distribution of citric.acid - fixed.acidity based on the quality show a strong relationship with a correlation value of 0.67.
Distribution of pH- fixed.acidity with citric.acid as color and face_wrap with quality_cat has a four variable distribution. When I used factor of pH values, I had a huge list of values in different colours(created different levels). So, I used pH values without the factor. I was so curious to explore this combo because of the strong relationship between the variables pH, fixed.acidity and citric.acid.
Density - pH and Density - citric.acid based on the wine quality have a moderate relationship.
Quality have a higher relationship with Alcohol than with other variables. So I decided to add one more variable like density, volatile.acidity, sulphates and check for the interesting plots. Most of the datapoints from average wine quality(of values 5,6,7).
The distribution of pH - fixed.acidity with citric.acid values as colour and face wrap with Quality_cat. I was so curious to explore this combo because of the strong relationship between the variables pH, fixed.acidity and
citric.acid. Quality have a better linear association with Alcohol than with any other variables. Even though it is better, but it isn’t strong. Quality have a second better relationship with volatile.acidity.
Alcohol- density-quality plot have a surprising relationship with a correlation value of -0.496. Alcohol have a better relationship with density than with any other variables.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = red_wine)
## m2: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar,
## data = red_wine)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar +
## fixed.acidity + chlorides, data = red_wine)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar +
## fixed.acidity + chlorides + sulphates + pH, data = red_wine)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar +
## fixed.acidity + chlorides + sulphates + pH + citric.acid +
## density, data = red_wine)
## m6: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar +
## fixed.acidity + chlorides + sulphates + pH + citric.acid +
## density + free.sulfur.dioxide + total.sulfur.dioxide, data = red_wine)
##
## ============================================================================================================
## m1 m2 m3 m4 m5 m6
## ------------------------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 3.099*** 2.748*** 3.656*** 27.461 21.965
## (0.175) (0.186) (0.225) (0.594) (21.247) (21.195)
## alcohol 0.361*** 0.314*** 0.317*** 0.305*** 0.290*** 0.276***
## (0.017) (0.016) (0.016) (0.017) (0.026) (0.026)
## volatile.acidity -1.383*** -1.279*** -1.061*** -1.195*** -1.084***
## (0.095) (0.099) (0.101) (0.119) (0.121)
## residual.sugar -0.002 -0.006 -0.003 0.010 0.016
## (0.012) (0.012) (0.012) (0.015) (0.015)
## fixed.acidity 0.038*** 0.011 0.054* 0.025
## (0.010) (0.013) (0.025) (0.026)
## chlorides -0.445 -1.882*** -1.586*** -1.874***
## (0.365) (0.404) (0.417) (0.419)
## sulphates 0.839*** 0.885*** 0.916***
## (0.111) (0.114) (0.114)
## pH -0.335* -0.250 -0.414*
## (0.155) (0.189) (0.192)
## citric.acid -0.361* -0.183
## (0.143) (0.147)
## density -24.284 -17.881
## (21.678) (21.633)
## free.sulfur.dioxide 0.004*
## (0.002)
## total.sulfur.dioxide -0.003***
## (0.001)
## ------------------------------------------------------------------------------------------------------------
## R-squared 0.227 0.317 0.323 0.349 0.352 0.361
## adj. R-squared 0.226 0.316 0.321 0.346 0.348 0.356
## sigma 0.710 0.668 0.665 0.653 0.652 0.648
## F 468.267 246.776 152.191 121.641 95.823 81.348
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1621.802 -1614.448 -1583.925 -1580.005 -1569.138
## Deviance 805.870 711.786 705.269 678.850 675.530 666.411
## AIC 3448.114 3253.605 3242.896 3185.850 3182.010 3164.277
## BIC 3464.245 3280.491 3280.536 3234.244 3241.158 3234.179
## N 1599 1599 1599 1599 1599 1599
## ============================================================================================================
I built a linear model for quality using all variables describing chemical properties of the wine. R -squared values are 0.4(max) and 0.2(min). R squared values helps in drawing conclusions about how changes in the predictor values are associated with changes in the response value.
This model has a limitation. The red wine dataset does not include the wine quality over 8 and winequality under 3 . when the data with winequality under 3 and wine quality over 8 are added , this will improve the linear model’s outcome.
## 3 4 5 6 7 8
## 10 53 681 638 199 18
The distribution of quality has the highest peak at 5 and the second highest at 6. The quality ranges from 3( bad) to 8 (very good) and quality is the output variable. The summary shows that the distribution of 1599 observations in the wine quality. 681 observations falls in the wine quality of 5 and 638 observations falls in the wine quality of 6.
## red_wine$wine_quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.575 11.000
## --------------------------------------------------------
## red_wine$wine_quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## red_wine$wine_quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## red_wine$wine_quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## red_wine$wine_quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## red_wine$wine_quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
In the boxplot for alcohol percentage based on the wine quality plot, the wine of good quality have a higher percentage of alcohol content(over 11.3). The medians of bad, average and good quality wines are gradually increasing. But, bad quality wines and average quality wines share almost same alcohol ranges. As you can see, the average quality wine has a lower first quartile value than the bad quality wine. This plot gives the impression that alcohol plays an important role in deciding the wine quality.
In the bar plot, most of the wines falls into the average quality. It clearly demonstrates the predominant distribution of wine_quality values of 5, 6 and 7. The Alcohol vs wine quality have a positive correlation than quality - any other variable combination.
The summary results of Alcohol vs quality has a highest median value of 12.15 for the wine quality value of 8. The wine quality of 5 has the lowest median value of 9.7.
## fixed.acidity citric.acid quality
## fixed.acidity 1.0000000 0.6717034 0.1240516
## citric.acid 0.6717034 1.0000000 0.2263725
## quality 0.1240516 0.2263725 1.0000000
This scatterplot shows the distribution of fixed acidity and citric.acid based on the wine quality. citric.acid can add ‘freshness’ and flavor to wines. Most acids involved with wine do not evaporate readily. Chemically the acids influence titrable acidity which affects the taste of the wine. As you can see, there is a positive correlation between citric.acid and fixed.acidity. When these two variables increases, it affects the taste and quality of the wine.
The correlation value of these variables shows that - fixed.acidity - citric.acid have a correlation value of 0.672 - quality has a better correlation value with citric.acid than the fixed.acidity as it can add freshness and flavor to wines thus improving the quality. - When fixed.acidity and citric.acid increases, it increases the titrable acidity thus affecting the taste and quality of the wine.
I had trouble in the correlation matrix or plot section. I used ggpairs to create the correlation plot. The plot was fine but the correlation values aren’t clearly fully visible in some cells. Also the variables like free.sulfur.dioxide and total.sulfur.dioxide didn’t fit within the column cell. I tried different ways to solve this issue but had no success. I submitted the project without solving that issue.
In the review, I was suggested to use fig.width and fig.height for this issue. I applied this in the correlation plot. It did help. The correlation values are readable and of same size now but I had problem with plot. Nearly 30% of the matrix was not visible, so I decided to search for other options.
I found a chart.Correlation from the package(“PerformanceAnalytics”). I got a pretty matrix but had a problem with correlation value text in different sizes. The strong correlation values are in appropriate font size whereas weak correlation values were in small font size.
Then I noticed the pretty correlation plot (using psych package) given in the example project of the project description section. I renamed the variables like free.sulfur.dioxide and total.sulfur.dioxide as free.so2 and total.so2 respectively. fig.width and fig.height greatly helped in restoring its size and the plot looks pretty now.
The dataset lacks the wine quality values under 3 and over 8. These values would have impacted the outcome of the output variable. When a dataset includes the wine quality values ranging from 0 to 10, the outcome would be more precise than the current one.
There is no data about grape types, wine brand, wine selling price, etc. Each grape type vary with the sweetness, chemical properties, etc from other grape types. This would have helped in understanding the chemical properties better and hence the influence on the wine quality.
Each wine brand has a unique way of making wines or following the standard way of making wine with interesting twist on some/all processes. So, any chemical property change because of the wine making process might have influenced the wine quality.
Weather also plays an important reason for the end result of wine quality. A good weathered year produces better tasting, healthy grapes thus influencing on the wine quality.
Moreover, more sophisticated prediction models should be able to provide more accurate predictions for the quality of wine based on its chemical characteristics.